An algorithm for correcting mislabeled data

نویسندگان

  • Xinchuan Zeng
  • Tony R. Martinez
چکیده

Reliable evaluation for the performance of classifiers depends on the quality of the data sets on which they are tested. During the collecting and recording of a data set, however, some noise may be introduced into the data, especially in various real-world environments, which can degrade the quality of the data set. In this paper, we present a novel approach, called ADE (automatic data enhancement), to correct mislabeled data in a data set. In addition to using multi-layer neural networks trained by backpropagation as the basic framework, ADE assigns each training pattern a class probability vector as its class label, in which each component represents the probability of the corresponding class. During training, ADE constantly updates the probability vector based on its difference from the output of the network. With this updating rule, the probability of a mislabeled class gradually becomes smaller while that of the correct class becomes larger, which eventually causes the correction of mislabeled data after a number of training epochs. We have tested ADE on a number of data sets drawn from the UCI data repository for nearest neighbor classifiers. The results show that for most data sets, when there exists mislabeled data, a classifier constructed using a training set corrected by ADE can achieve significantly higher accuracy than that without using ADE.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Algorithm for Recognizing Mislabeled and Abnormal Samples in Cancer Microarray

Microarray is a high-throughput experimental technology which has been used in many life-science areas especially in medical applications. The sample classification problem is crucial for disease diagnosis and treatment. However, the process of sample labeling can be very complex and partially subjective. Existing studies confirm this phenomenon and show that even a very small number of error s...

متن کامل

Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy

Many machine learning datasets are noisy with a substantial number of mislabeled instances. This noise yields sub-optimal classification performance. In this paper we study a large, low quality annotated dataset, created quickly and cheaply using Amazon Mechanical Turk to crowdsource annotations. We describe computationally cheap feature weighting techniques and a novel non-linear distribution ...

متن کامل

A noise filtering method using neural networks - Soft Computing Techniques in Instrumentation, Measurement and Related Applications, 2003. SCIMA 20

A = During the data collecling and labeling process it is possible for noise to be introduced into a dato set. As a result, the quality of the data set degrades and experiments and inferences derivedfrom the data set become less reliable. In th tpaper we present an algorithm, called A N R (automati? noise reduction), as apltering mechanism lo identify and remove noisy data items whose classes h...

متن کامل

Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model

MOTIVATION Mislabeled samples often appear in gene expression profile because of the similarity of different sub-type of disease and the subjective misdiagnosis. The mislabeled samples deteriorate supervised learning procedures. The LOOE-sensitivity algorithm is an approach for mislabeled sample detection for microarray based on data perturbation. However, the failure of measuring the perturbin...

متن کامل

An Effective Iterative Algorithm for Modeling of Multicomponent Gas Separation in a Countercurrent Membrane Permeator

 A model is developed for separation of multicomponent gas mixtures in a countercurrent hollow fiber membrane module. While the model’s solution in countercurrent module usually involves in a time consuming iterative procedure, a proper initial guess is proposed for beginning the calculation and a simple procedure is introduced for correcting the guesses, hereby the CPU time is decreased ess...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Intell. Data Anal.

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2001